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1.  INTRODUCTION 

In  TREC  2008,  we  participate  in  the  Blog,  Enterprise,  and  Rel¬ 
evance  Feedback  tracks.  In  all  tracks,  we  continue  the  research 
and  development  of  the  Terrier  platform1  centred  around  extending 
state-of-the-art  weighting  models  based  on  the  Divergence  From 
Randomness  (DFR)  framework  [26],  In  particular,  we  investigate 
two  main  themes,  namely,  proximity-based  models,  and  collection 
and  profile  enrichment  techniques  based  on  several  resources. 

In  the  Blog  track,  we  aim  to  improve  our  opinion  detection  tech¬ 
niques  and  to  integrate  various  new  blog-specific  features  into  our 
Voting  Model  [18].  For  the  baseline  ad-hoc  task,  we  aim  to  build 
strongly  performing  baselines  by  applying  two  different  techniques. 
The  first  one  boosts  documents  in  which  query  terms  co-occur  in 
a  given  window  size,  and  the  second  one  applies  query  expansion 
using  collection  enrichment.  Non-English  documents  are  also  re¬ 
moved  front  the  retrieved  results. 

In  the  opinion-finding  task,  we  experiment  with  two  main  opin¬ 
ion  detection  approaches.  The  first  one  improves  our  TREC  2007 
dictionary-based  approach  by  automatically  building  an  internal 
opinion  dictionary  from  the  collection  itself.  We  measure  the  opin¬ 
ionated  discriminability  of  each  term  using  an  information-theoretic 
divergence  measure  based  on  the  relevance  assessments  of  previous 
years.  The  second  approach  is  based  on  the  OpinionFinder  tool, 
which  identifies  subjective  sentences  in  text.  In  particular,  we  in¬ 
troduce  a  novel  method  to  measure  the  informativeness  of  query 
terms  occurring  in  close  proximity  to  subjective  sentences. 

In  the  blog  distillation  task,  we  have  two  research  themes.  Firstly, 
we  aim  to  extend  our  Voting  Model  with  a  component  to  focus  on  a 
balanced  and  neutral  retrieval  that  does  not  favour  prolific  bloggers. 
The  Voting  Model  is  based  on  the  intuition  that  a  relevant  blogger 
will  post  repeatedly  around  a  topic  area.  By  treating  each  relevant 
post  as  a  vote  for  that  blog  to  be  relevant,  we  can  infer  a  ranking 
of  blogs.  This  approach  is  based  on  voting  techniques  inspired  by 
electoral  social  choice  theory  and  data  fusion  [18].  Neutrality  is 
an  important  concept  in  an  election  -  each  candidate  should  have 
an  equal  chance  of  getting  elected.  Similarly,  bloggers  should  have 
an  equal  chance  of  getting  retrieved  for  a  query,  regardless  of  how 
many  posts  they  have  made.  With  this  in  mind,  we  investigate  the 
application  of  normalisation  techniques  in  this  task. 

In  our  participation  in  the  first  Relevance  Feedback  track,  we 
aim  to  develop  new  techniques  on  top  of  our  DFR  query  expansion 

1  Information  on  Terrier  can  be  found  at: 

http ://ir. dcs.gla.ac.uk /terrier/ 


models.  First,  we  expand  the  query  by  measuring  the  divergence 
of  a  term’s  distribution  in  a  relevance  set  to  its  distribution  in  the 
whole  collection  using  the  Kullback-Leibler  (KL)  divergence  mea¬ 
sure.  This  relevance  set  can  be  either  pseudo,  which  consists  of  the 
top  returned  documents,  or  explicit,  which  consists  of  the  judged 
relevant  documents  as  in  the  provided  feedback  sets  B  to  E. 

Second,  we  expand  the  query  from  surrogates  of  the  documents, 
instead  of  all  their  text  content.  These  document  surrogates  are 
created  by  a  low-cost,  syntactically-based  information  processing 
model,  which  uses  surface-syntactic  evidence  to  automatically  iden¬ 
tify  informative  content  and  to  reduce  the  noise  from  any  textual 
input.  We  use  the  surface-syntactic  approach  to  prune  the  feedback 
documents  before  selecting  the  query  expansion  terms,  allowing 
(noisy)  terms  carried  in  unusual  syntactic  structures  to  be  ignored. 

In  the  Enterprise  track,  we  participate  in  both  the  document  and 
the  expert  search  tasks.  In  keeping  with  one  of  the  central  themes 
for  our  TREC  2008  participation,  we  investigate  the  application  of 
suitable  external  evidence  in  both  tasks. 

In  the  document  search  task,  we  investigate  how  external  re¬ 
sources  can  be  used  to  enhance  the  retrieval  performance  through 
a  collection  enrichment  approach.  Our  external  resources  are  ob¬ 
tained  from  the  top-ranked  results  produced  by  different  commer¬ 
cial  search  engines.  Furthermore,  we  test  how  the  selective  appli¬ 
cation  of  collection  enrichment  can  provide  further  improvements. 

In  the  expert  search  task,  we  have  two  aims.  Firstly,  to  extend 
and  further  test  our  novel  Voting  Model  and,  secondly,  to  use  ex¬ 
ternal  evidence  of  expertise.  The  Voting  Model  takes  into  account 
various  sources  of  evidence  of  candidate’s  expertise,  by  examining 
the  ranking  of  documents  with  respect  to  the  query,  and  inferring 
votes  for  candidates  to  be  relevant.  The  model  is  expanded  to  take 
into  account  several  input  rankings  of  documents.  In  particular, 
we  use  document  rankings  produced  by  Web  search  engines  as  an 
external  evidence  of  the  expertise  of  the  candidates. 

The  remainder  of  this  paper  is  structured  as  follows.  Section  2 
describes  the  DFR  models  we  use  in  TREC  2008.  Section  3  details 
our  indexing  specifications.  Section  4  describes  our  approaches  in 
each  of  the  Blog  track  post  retrieval  tasks  along  with  their  corresp¬ 
onding  evaluation.  Section  5  details  our  participation  in  the  Enter¬ 
prise  document  search  task,  while  Section  6  discusses  the  results  of 
our  first  participation  in  the  Relevance  Feedback  track.  Sections  7 
and  8  cover  our  participation  in  the  Enterprise  track  expert  search 
task  and  on  the  Blog  track  blog  distillation  task,  respectively.  Fi¬ 
nally,  Section  9  presents  our  final  remarks. 
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2.  MODELS 

Following  from  previous  years,  our  research  in  Terrier  centres  on 
extending  the  Divergence  From  Randomness  framework  (DFR)  [1], 
In  Section  2.1,  we  present  existing  DFR  weighting  models  we  ex¬ 
perimented  with  in  TREC  2008,  while,  in  Section  2.2,  we  present 
our  existing  DFR  model  that  captures  terms  dependence  and  prox¬ 
imity.  Section  2.3  presents  the  Bol  and  KL  DFR  term  weighting 
models  for  query  expansion. 


2.1  Divergence  From  Randomness  Weighting 
Models 

Document  structure  (i.e.  fields,  such  as  titles)  has  been  shown  to 
be  useful  when  ranking  documents.  A  field-based  weighting  model 
is  one  that  considers  the  occurrences  of  query  terms  in  different 
fields.  Robertson  et  al.  [31]  observed  that  the  linear  combination  of 
scores,  which  has  been  the  approach  mostly  used  for  the  combina¬ 
tion  of  fields,  is  difficult  to  interpret  due  to  the  non-linear  relation 
between  the  scores  and  the  term  frequencies  in  each  of  the  fields. 
In  addition,  Hawking  et  al.  [10]  showed  that  the  length  normali¬ 
sation  that  should  be  applied  to  each  field  depends  on  the  nature 
of  the  field.  Zaragoza  et  al.  [36]  introduced  a  field-based  version 
of  BM25,  called  BM25F,  which  applies  length  normalisation  and 
weighting  of  the  fields  independently.  Macdonald  et  al.  [25]  also  in¬ 
troduced  Normalisation  2F  in  the  DFR  framework  for  performing 
independent  term  frequency  normalisation  and  weighting  of  fields. 

In  this  work,  we  use  a  field-based  model  from  the  DFR  frame¬ 
work,  namely  PL2F.  Using  the  PL2F  model,  the  relevance  score  of 
a  document  d  for  a  query  Q  is  given  by: 


score(d,  Q) 


^gtw'F^TT^/n'l0g2^  (1) 

teQ  J 

+(A  -  tfn)  ■  log 2  e  +  0.5  •  log2(27r  •  tfn )) 


where  A  is  the  mean  and  variance  of  a  Poisson  distribution,  given 
by  A  =  F/N.  F  is  the  frequency  of  the  query  term  t.  in  the  whole 
collection,  and  N  is  the  number  of  documents  in  the  whole  collec¬ 
tion.  The  query  term  weight  qtw  is  given  by  qtf  /qtf max  ■  Qtf  is 
the  query  term  frequency,  and  qtf  max  is  the  maximum  query  term 
frequency  among  all  query  terms.  In  PL2F,  tfn  corresponds  to  the 
weighted  sum  of  the  normalised  term  frequencies  tff  for  each  used 
field  /,  a  technique  known  as  Normalisation  2F  [25]: 


tfn  =  (wf  ‘  tff  '  loS2  (i  +  cf  •  a^J/))  ,  (c/  >  0) 

(2) 

where  tff  is  the  frequency  of  term  t  in  field  /  of  document  d, 
l  f  is  the  length  in  tokens  of  field  /  in  document  d,  and  avgJ /  is 
the  average  length  of  the  field  across  all  documents,  c/  is  a  hyper¬ 
parameter  for  each  field,  which  controls  the  term  frequency  normal¬ 
isation;  the  importance  of  the  term  occurring  in  field  /  is  controlled 
by  the  weight  w  / . 

Note  that  the  classical  DFR  weighting  model  PL2  can  be  gener¬ 
ated  by  using  Normalisation  2  instead  of  Normalisation  2F  for  tfn 
in  Equation  (1)  above.  Normalisation  2  is  given  by: 

tfn  =  tf  •  log2  ^1  +  c  •  ,  (c  >  0)  (3) 


where  tf  is  the  frequency  of  term  t  in  the  document  d,  l  is  the 
length  of  the  document  in  tokens,  and  avg  J  is  the  average  length 
of  all  documents,  c  is  a  hyper-parameter  that  controls  the  term  fre¬ 
quency  normalisation  with  respect  to  the  document  length. 

Another  weighting  model  used  in  our  participation  in  TREC  is 
the  InLB  model,  which  is  applied  in  the  Blog  track  opinion-finding 


task.  This  model  applies  the  Inverse  Document  Frequency  and 
Laplace  succession  for  document  weighting  [1],  as  well  as  BM25’s 
term-frequency  normalisation  function  [32],  In  InLB,  for  a  given 
document  d  and  query  Q ,  the  relevance  score  is  given  by: 


score(d,  Q)  =  ^  w(d,  t)  ==  V  log2 

teQ  teQ  tn 


N  +  1 
df  +  0.5 


(4) 


where  the  query  term  weight  qtw  is  given  by  qtf  /qtf max,  Qtf  is 
the  query  term  frequency,  and  qtf  max  is  the  maximum  query  term 
frequency  among  all  query  terms.  N  is  the  number  of  documents 
in  the  collection,  and  df  is  the  number  of  documents  containing 
the  query  term  t.  The  normalised  term  frequency  tfn  is  given  by 
BM25"s  normalisation  function  [32]  as  follows: 


tfn 


tf 

(1  -  b)  +  b  • 


(5) 


where  tf  is  the  within-document  term  frequency,  l  is  the  document 
length,  and  avgJ  is  the  average  document  length  in  the  whole  col¬ 
lection.  b  is  a  free  parameter.  In  this  paper,  we  set  b  to  0.2337  after 
training  on  the  100  topics  from  the  TREC  2006  and  2007  Blog  track 
opinion-finding  tasks,  numbered  851  to  950. 

The  last  weighting  model  used  in  this  work  is  the  DPH  model, 
also  derived  front  the  DFR  framework.  Using  DPH,  the  relevance 
score  of  a  document  d  for  a  query  Q  is  given  by  [2] : 

score(d,Q)  =  +  qt 1  F l  ■  (t f  ■  log2 (t f  •  F- ) ) 

+0.5  •  log2(27r  •  tf  •  (1  -  F))  (6) 


where  F  is  given  by  tf  /l,  tf  is  the  within-document  frequency,  and 
/  is  the  document  length  in  tokens.  avgJ  is  the  average  document 
length  in  the  collection,  A  is  the  number  of  documents  in  the  collec¬ 
tion,  and  TF  is  the  term  frequency  in  the  collection.  Note  that  DPH 
is  a  parameter-free  model.  All  variables  in  its  formula  can  be  di¬ 
rectly  obtained  front  the  collection  statistics.  No  parameter  tuning 
is  required  to  optimise  DPH,  and  we  can  rather  focus  on  study¬ 
ing  query  expansion,  qtw  is  the  query  term  weight  and  is  given  by 
qtf  /qtf max,  where  qtf  is  the  query  term  frequency  and  qtf  max  is 
the  maximum  query  term  frequency  among  all  query  terms. 


2.2  Terms  Dependence  in  the  Divergence  F rom 

Randomness  Framework 

We  believe  that  taking  into  account  the  dependence  and  proxim¬ 
ity  of  query  terms  in  documents  can  increase  the  retrieval  effective¬ 
ness.  To  this  end,  we  extend  the  DFR  framework  with  models  for 
capturing  the  dependence  of  query  terms  in  documents.  Follow¬ 
ing  [3],  the  models  are  based  on  the  occurrences  of  pairs  of  query 
terms  that  appear  within  a  given  number  of  terms  of  each  other  in 
the  document.  The  introduced  weighting  models  assign  scores  to 
pairs  of  query  terms,  in  addition  to  the  single  query  terms.  The 
score  of  a  document  d  for  a  query  Q  is  given  as  follows: 

score(d,Q)  =  ^^qtw  ■  score(d,t)  +  score(d,p)  (7) 
teQ  p€Q2 

where  score(d,t)  is  the  score  assigned  to  a  query  term  t.  in  the 
document  d,  p  corresponds  to  a  pair  of  query  terms,  and  Q2  is 
the  set  that  contains  all  possible  combinations  of  two  query  terms. 
In  Equation  (7),  itw  ■  score(d,  t)  can  be  estimated  by  any 

DFR  weighting  model,  with  or  without  fields.  The  score(d,p)  of 
a  pair  of  query  terms  in  a  document  is  computed  as  follows: 

score(d,p)  =  -log2(Ppi)  •  (1  -  Pp2) 


(8) 


where  Pp i  corresponds  to  the  probability  that  there  is  a  document  in 
which  a  pair  of  query  terms  p  occurs  a  given  number  of  times.  Pp i 
can  be  computed  with  any  randomness  model  from  the  DFR  frame¬ 
work,  such  as  the  Poisson  approximation  to  the  Binomial  distribu¬ 
tion.  Pp2  corresponds  to  the  probability  of  seeing  the  query  term 
pair  once  more,  after  having  seen  it  a  given  number  of  times.  Pp 2 
can  be  computed  using  any  of  the  after-effect  models  in  the  DFR 
framework.  The  difference  between  score(d,p)  and  score(d,t) 
is  that  the  former  depends  on  counts  of  occurrences  of  the  pair  of 
query  terms  p,  while  the  latter  depends  on  counts  of  occurrences  of 
the  query  term  t. 

This  year,  we  applied  the  pBiL2  randomness  model  [17],  which 
does  not  consider  the  collection  frequency  of  pairs  of  query  terms. 
It  is  based  on  the  binomial  randomness  model,  and  computes  the 
score  of  a  pair  of  query  terms  in  a  document  as  follows: 


score(rf’p)  =  r( 

+ 


log2  ( avgjw  -  1)!  +  log2p/n! 

log 2{avg.w  -  1  —  pfn)\ 
pfn\og2(pp)  (9) 

{avgjw  -  1  -  pfn)  log 2(p'P)) 


where  liniF^t/x  w(t)  is  the  upper  bound  of  w(t).  Pn,max  is  given 
by  Fmax/N.  Fmax  is  the  frequency  F  of  the  term  with  the  maxi¬ 
mum  w(t)  in  the  top-ranked  documents.  If  a  query  term  does  not 
appear  among  the  most  informative  terms  from  the  top-ranked  doc¬ 
uments,  its  query  term  weight  remains  equal  to  the  original  one. 

Another  term  weighting  model  employed  by  Terrier  is  based  on 
the  KL  divergence  measure.  Using  the  KL  model,  the  weight  of  a 
term  t  in  the  feedback  document  set  D  is  given  by  [1]: 

w(t)=p(t\D).log2^^f)  (12) 

where  p(t\D)  =  tfx/c(D)  is  the  probability  of  observing  the  term 
t  in  the  feedback  document  set  D ,  tfx  is  the  frequency  of  the 
term  t  in  the  set  D  and  c(D)  is  the  number  of  tokens  in  this  set. 
p(t\Coll)  =  TF/c(Coll)  is  the  probability  of  observing  the  term 
t  in  the  whole  collection,  TF  is  the  frequency  of  t  in  the  collec¬ 
tion.  and  c(Coll)  is  the  number  of  tokens  in  the  collection.  In  our 
experiments,  the  feedback  document  set  contains  the  expjloc  top- 
ranked  documents,  from  which  the  expJterm  most  weighted  terms 
by  KL  are  then  extracted. 

Using  KL,  the  query  term  weight  qtw  is  also  determined  by 
Equation  (11),  while  the  upper  bound  of  w(t)  is  given  by: 


where  avgjw  =  T-N(^’a~1)  )s  the  average  number  of  windows  of 
size  ws  tokens  in  each  document  in  the  collection,  N  is  the  number 
of  documents  and  T  is  the  total  number  of  tokens  in  the  collection. 
pp  =  avg1w_1 ,  Pp  =  1  —  pP,  and  pfn  is  the  normalised  frequency 
of  the  tuple  p,  as  given  by  Normalisation  2:  pfn  =  pf  ■  log2(l  + 
cp  ■  In  Normalisation  2,  pf  is  the  number  of  windows  of 

size  ws  in  document  d  in  which  the  tuple  p  occurs,  l  is  the  length 
of  the  document  in  tokens,  and  cp  >  0  is  a  hyper-parameter  that 
controls  the  normalisation  applied  to  the  pf  frequency  with  respect 
to  the  number  of  windows  in  the  document. 


2.3  Term  Weighting  Models  for  Query  Expan¬ 
sion 

Terrier  implements  a  list  of  DFR-based  term  weighting  mod¬ 
els  for  query  expansion.  The  basic  idea  of  these  term  weighting 
models  is  to  measure  the  divergence  of  a  term’s  distribution  in  a 
pseudo-relevance  set  from  its  distribution  in  the  whole  collection. 
The  higher  this  divergence  is,  the  more  likely  the  term  is  related  to 
the  query  topic.  Among  the  term  weighting  models  implemented 
in  Terrier,  Bol  is  one  of  the  best-performing  ones  [1]. 

The  Bol  term  weighting  model  is  based  on  the  Bose-Einstein 
statistics.  Using  this  model,  the  weight  of  a  term  t.  in  the  exp.doc 
top-ranked  documents  is  given  by: 

w(t)  =  tfx  ■  Iog2  1  +  Pn  +  log2(l  +  Pn)  (10) 

T n 

where  exp. doc  usually  ranges  from  3  to  10  [1],  Another  parameter 
involved  in  the  query  expansion  mechanism  is  expJterm,  the  num¬ 
ber  of  terms  extracted  from  the  expjloc  top-ranked  documents. 
expJterm  is  usually  larger  than  expjloc  [1],  Pn  is  given  by  jj,F 
is  the  frequency  of  the  term  t  in  the  collection,  and  N  is  the  number 
of  documents  in  the  collection.  tfx  is  the  frequency  of  the  query 
term  t  in  the  expjloc  top-ranked  documents. 

Terrier  employs  a  parameter-free  function  to  determine  the  query 
term  weight  qtw  (see  Equation  (1)),  which  is  given  as  follows: 


qtw  = 


Qtf 


w(t ) 


Qtfn 


lim F^tfx  w(t) 


(ID 


=  Frr 


•  log; 


1  /  ( 


n.max 


+  1°§2  (1  +  Pn  ,max  ) 


n.max 


lim  wit)  = 

F — *tfx 


1  c(Coll) 

■  log2  ~iZ 


(13) 


where  Fmax  is  the  collection  frequency  F  of  the  term  with  the 
maximum  w(t)  in  the  top-ranked  documents,  lx  is  the  length  of 
the  feedback  documents,  and  c( Coll)  is  the  number  of  tokens  in  the 
whole  collection.  Note  that  the  DFR  query  expansion  framework 
is  similar  to  Rocchio’s  relevance  feedback  method  [33].  The  dif¬ 
ference  is  that  the  former  considers  the  whole  feedback  document 
set  as  a  bag  of  words,  while  the  latter  averages  term  weights  over 
single  feedback  documents. 


3.  INDEXING 

In  TREC  2008,  we  participate  in  the  Blog,  Enterprise,  and  Rel¬ 
evance  feedback  tracks.  The  test  collection  for  the  Blog  track  is 
the  TREC  Blogs06  collection  [23],  which  is  a  crawl  of  100k  blogs 
over  an  11-week  period.  During  this  time,  the  blog  posts  (perma- 
links),  feeds  (RSS,  XML,  etc.)  and  homepages  of  each  blog  were 
collected.  In  our  participation  in  the  Blog  track,  we  index  only  the 
permalinks  component  of  the  collection.  There  are  approximately 
3.2  million  documents  in  the  permalinks  component.  For  the  Rel¬ 
evance  feedback  track,  the  test  collection  is  the  large-scale  .GOV2 
collection,  which  has  an  uncompressed  size  of  426G.  For  indexing 
purposes,  we  treat  the  above  two  collections  in  the  same  way.  Us¬ 
ing  the  Terrier  IR  platfomi  [26],  we  create  content-based  indices, 
including  the  document  content  and  the  titles. 

For  the  Enterprise  track,  we  use  the  CSIRO  Enterprise  Research 
Collection  (CERC),  which  is  a  crawl  of  the  csiro.au  domain 
(370k  documents).  CSIRO  is  a  real  Enterprise-sized  organisation. 
To  support  the  field-based  weighting  models,  we  index  separate 
fields  of  the  documents,  namely  the  content,  the  title,  and  the  an¬ 
chor  text  of  the  incoming  hyperlinks. 

For  all  the  three  collections,  each  term  is  stemmed  using  Porter’s 
English  stemmer,  and  normal  English  stopwords  are  removed. 


4.  BLOG  TRACK: 

BLOG  POST  RETRIEVAL  TASKS 

Following  the  TREC  guidelines,  in  the  Blog  post  retrieval  tasks, 
namely,  baseline  ad-hoc,  opinion-finding,  and  polarity  tasks,  we 


submit  runs  based  on  all  150  topics  developed  so  far.  In  this  section, 
however,  unless  otherwise  stated,  we  report  our  results  on  the  new 
topics  only,  i.e.,  topics  1001-1050.  Additionally,  all  our  runs  use 
only  the  title  of  the  topics. 

4.1  Baseline  Ad-hoc  Retrieval  Task 

Following  the  Blog  track  guidelines,  in  the  baseline  ad-hoc  re¬ 
trieval  task,  we  submit  two  runs  solely  aimed  at  retrieving  topic¬ 
relevant  documents,  i.e.,  with  no  opinion  feature  enabled.  Our  first 
baseline,  uogBLProx,  applies  the  InLB  document  weighting  model 
and  the  pBiL2  term  proximity  model,  as  described  in  Equations  (4) 
and  (9),  respectively.  In  addition,  we  remove  non-English  blog 
posts  from  the  returned  results  as  language  filtering  was  shown  to 
be  beneficial  in  previous  Blog  tracks  [24,  27],  On  top  of  uogBL¬ 
Prox,  our  second  baseline,  uogBLProxCE,  applies  the  Bol  term 
weighting  model  for  query  expansion  on  an  external  collection, 
namely,  the  Aquaint2  collection,  a  timely  news  resource. 

Table  1  summarises  the  retrieval  performance  of  our  two  submit¬ 
ted  baseline  runs  in  terms  of  both  topic-relevance  (rel)  and  opinion¬ 
finding  (op).  The  median  of  the  participating  groups  in  this  task  for 
the  2008  topics,  1001-1050,  is  also  shown.  From  the  table,  we  can 
see  that  our  baselines  markedly  outperform  the  TREC  median  per¬ 
formance,  both  in  terms  of  topic-relevance  and  opinion-finding. 


Run 

MAP  ret 

P@10re; 

MAPop 

P@10op 

TREC  median 

0.3529 

0.6960 

0.2890 

0.5700 

uogBLProx 

uogBLProxCE 

0.4141 

0.4219 

0.6840 

0.7060 

0.3464 

0.3531 

0.5820 

0.6100 

Table  1:  Results  of  submitted  runs  in  the  baseline  ad-hoc  re¬ 
trieval  task  for  topics  1001-1050. 

4.2  Opinion-Finding  Task 

In  the  opinion-finding  task,  we  experiment  with  two  main  ap¬ 
proaches  for  detecting  opinionated  documents.  The  first  approach 
improves  our  TREC  2007  dictionary-based  approach  by  automat¬ 
ically  building  an  internal  opinion  dictionary  from  the  collection 
itself.  The  second  approach  is  based  on  the  OpinionFinder  tool, 
which  identifies  subjective  sentences  in  text.  In  particular,  we  in¬ 
troduce  a  novel  method  to  measure  the  informativeness  of  query 
terms  occurring  in  a  close  proximity  to  subjective  sentences. 

In  our  first  opinion  detection  approach  [12],  a  dictionary  of  sub¬ 
jective  terms  is  automatically  derived  from  the  target  collection 
without  requiring  any  manual  effort.  In  particular,  from  the  list 
of  all  terms  in  the  collection  ranked  by  their  within-collection  fre¬ 
quency  in  descending  order,  a  skewed  query  model  is  applied  to 
filter  out  those  that  are  too  frequent  or  too  rare  [5].  This  aims  to 
remove  terms  with  too  little  or  too  specific  information  and  which 
thus  cannot  be  interpreted  as  general  opinion  indicators  for  differ¬ 
ent  queries.  Using  the  Bol  term  weighting  model  (Equation  (10)) 
and  a  training  set  comprising  the  100  topics  for  the  TREC  2006 
and  2007  Blog  track  opinion-finding  tasks,  851-950,  the  remaining 
terms  front  the  list  are  weighted  based  on  the  divergence  of  their 
distribution  in  the  set  D(opRel)  of  relevant  and  opinionated  doc¬ 
uments  retrieved  for  these  topics  against  that  in  the  set  D(rel )  of 
relevant  documents  retrieved  for  the  same  set  of  topics. 

To  compare  with  the  dictionary  derived  from  the  collection  it¬ 
self,  we  also  manually  generate  a  dictionary  compiled  from  various 
linguistic  resources  such  as  OpinionFinder  [35].  This  dictionary 
contains  around  12,000  English  words,  mostly  adjectives,  adverbs 
and  nouns,  which  are  supposed  to  be  subjective.  In  this  paper,  we 
denote  the  manually  edited  dictionary  by  the  external  dictionary , 
and  the  automatically  derived  one  by  the  internal  dictionary. 


We  submit  the  100  highly  weighted  terms  from  either  dictionary 
as  a  query  Qopn  and  assign  an  opinion  score  to  the  retrieved  doc¬ 
uments  using  the  InLB  DFR  weighting  model  (Equation  (4)).  For 
each  retrieved  document  d  for  a  given  new  query  Q,  we  combine 
its  topic-relevance  score  -  given  by  a  retrieval  baseline,  which  is 
independent  of  any  expressed  opinion  in  the  document  -  with  its 
opinion  score  to  produce  the  final  document  ranking.  We  have  ex¬ 
perimented  with  two  combination  methods.  The  first  method  ap¬ 
plies  the  following  linear  combination: 


scoreCOm(d,  Q)  =  (1  -  a)  ■  score(d,  Qopn)  +  a  ■  score(d,  Q) 

(14) 

where  score(d ,  Qop„ )  and  score(d ,  Q)  are  scaled  by  dividing  them 
by  the  maximum  score(d,  Qopn )  and  score(d,  Q),  respectively,  a 
is  the  free  parameter  of  the  linear  combination,  set  after  training  on 
the  100  topics  from  2006  and  2007. 

Our  second  combination  method  maps  each  opinion  score  to  the 
maximum  likelihood  of  the  probability  P(opn\d,  Qopn)  of  being 
opinionated  as  follows: 


P(opn\d,  Qopn) 


score(d,  Qopn) 

J2  score{d,  Qopn) 

dcColl 


(15) 


where  Coll  is  the  entire  document  collection.  Since  a  high  proba¬ 
bility  is  supposed  to  indicate  a  high  degree  of  opinion  expressed  in 
the  document,  we  would  like  to  have  a  combined  score  that  is  an 
increasing  function  of  P(opn\d,  Qopn).  Therefore,  such  a  proba¬ 
bility  P(opn\d,  Qopn)  is  combined  with  the  initial  relevance  score 
using  a  logarithmic  function  as  follows: 

—k 

scoreCOm{d ,  Q)  =  - - — - — - - - -  +  scoreid,  Q )  (16) 

log 2  P(opn\d ,  Qopn) 

where  k  is  a  free  parameter,  also  set  by  training. 

Both  score  combination  methods  use  the  stored  opinion  scores 
of  all  documents,  computed  during  indexing.  Therefore,  there  is 
only  a  negligible  additional  overhead  during  retrieval. 

Our  second  approach  in  this  task  [34]  uses  OpinionFinder  [35], 
a  Natural  Language  Processing-based  subjectivity  analysis  system, 
to  classify  the  subjectiveness  of  every  sentence  in  the  Blogs06  col¬ 
lection.  After  the  whole  collection  is  parsed,  we  index  it  by  con¬ 
sidering  the  sentence  tags  generated  by  OpinionFinder  as  special 
position  markers,  so  that  we  can  record  the  positions  of  every  index 
term  with  respect  to  the  sentences  in  which  it  occurs  within  a  given 
document.  We  then  boost  the  scores  of  the  retrieved  documents,  as 
given  by  the  InLB  weighting  model  (Equation  (4)),  based  on  the 
proximity  between  the  query  terms  and  the  subjective  sentences 
identified  in  each  of  these  documents  according  to  the  equation: 


scoreCOm(d,  Q)  =  (1  —  /?)  ^  score(d,p)  +  (3  ■  score(d,Q) 

P^QxS 

(17) 

where  the  pair  p  =  (t,  s)  comprises  a  query  term  f  front  the  query 
Q  and  a  subjective  sentence  s  from  the  set  S  of  all  subjective  sen¬ 
tences  identified  in  document  d.  In  order  to  compute  the  proximity 
score  score(d,p),  we  apply  the  pBiL  randomness  model  [17],  as 
given  by  Equation  (9),  except  that  the  proximity  windows  are  mea¬ 
sured  in  terms  of  number  of  sentences  instead  of  number  of  tokens 
and  Normalisation  2  is  not  applied,  since  it  did  not  show  significant 
improvements  in  our  experiments.  The  linear  combination  param¬ 
eter  p  is  trained  on  the  2006  and  2007  topics. 


4. 2. 1  Experiments 

In  2008,  the  TREC  Blog  track  organisers  provided  five  strongly 
performing,  yet  statistically  different  baselines.  Each  of  these  com¬ 
prises  a  list  of  retrieved  documents  produced  by  a  “black  box” 


search  engine  that  retrieves  as  many  topic-relevant  documents  as 
possible  without  applying  any  specific  opinion-finding  feature.  On 
top  of  each  of  our  two  baselines,  namely,  uogBLProx  and  uogBL- 
ProxCE,  and  the  5  standard  baselines  provided  by  the  Blog  track 
organisers  for  TREC  2008,  we  submitted  four  runs  as  follows.  The 
first  run  applies  our  dictionary-based  approach  using  either  the  in¬ 
ternal  or  the  external  opinion  dictionary.  The  second  run  applies  our 
TREC  2007  OpinionFinder-based  approach  [11],  while  the  third 
run  integrates  the  new  proximity  of  query  terms  to  subjective  sen¬ 
tences  approach.  The  last  run  combines  the  proximity  to  subjective 
sentences  feature  with  our  dictionary-based  approach.  Table  2  de¬ 
scribes  the  nomenclature  used  by  our  submitted  runs. 


Code 

Techniques 

OPbl 

Run  based  on  baseline  uogBLProx 

OPb2 

Run  based  on  baseline  uogBLProxCE 

OP1-5 

Runs  based  on  standard  baselines  1-5 

ext 

Dictionary-based  approach  (external  dictionary) 

int 

Dictionary-based  approach  (internal  dictionary) 

of 

OpinionFinder-based  approach 

Pr 

Proximity  to  OpinionFinder’s  classified  subjective  sentences 

L 

Logarithmic  combination 

1 

Linear  combination 

Table  2:  Techniques  applied  in  the  submitted  runs  in  the  Blog 
track  opinion-finding  task. 

Table  3  summarises  the  retrieval  performance  of  our  submitted 
runs  in  terms  of  topic-relevance  (rel)  and  opinion-finding  (op)  over 
the  7  baselines.  In  this  table,  an  asterisk  (*)  indicates  a  significant 
difference  ( p  <  0.05)  from  the  corresponding  baseline  run  accord¬ 
ing  to  the  Wilcoxon  matched-pairs  signed-ranks  test.  We  find  that 
all  our  4  approaches  provide  statistically  significant  improvement 
over  5  out  of  the  7  baselines.  As  for  the  remaining  baselines,  our 
proximity-based  approaches  significantly  improve  over  baseline  2, 
while  the  approaches  that  do  not  employ  proximity  significantly 
improve  over  baseline  4.  As  a  whole,  these  results  show  that  both 
of  our  approaches  are  effective  in  finding  opinionated  documents. 

In  order  to  further  investigate  the  robustness  of  our  approaches, 
Table  4  shows  the  opinion  MAP  performances  of  each  of  our  runs 
over  each  of  the  5  standard  baselines  as  well  as  the  average  perfor¬ 
mance  of  these  runs  across  the  baselines.  For  the  sake  of  legibil¬ 
ity,  the  4  approaches  deployed  over  each  baseline  are  generically 
referred  to  as  Diet,  OpinionFinder,  ProxSent.  and  Dict+ProxSent. 
For  each  baseline,  we  also  show  the  median  performance  of  the  21 
TREC  runs  that  were  deployed  over  all  the  standard  baselines.  The 
median  of  the  average  improvements  of  these  runs  is  also  shown. 
From  the  table,  we  can  see  that,  besides  improving  over  the  base¬ 
lines,  our  approaches  outperform  the  TREC  medians  in  most  set¬ 
tings.  Moreover,  on  average,  all  of  our  techniques  provide  improve¬ 
ments  across  the  5  baselines,  what  further  attests  their  robustness. 

Additionally,  Table  5  shows  the  performance  results  of  our  best 
opinion-finding  runs  for  each  of  the  7  baselines  in  terms  of  topic- 
relevance  and  opinion-finding  MAP  across  the  topics  for  each  of 
the  3  years  the  opinion-finding  task  was  run,  as  well  as  for  the  com¬ 
bined  topics  for  the  3  years.  From  Table  5,  we  can  observe  that,  for 
baselines  1,  3,  and  4,  our  opinion-finding  performance  increases 
over  the  three  years.  Interestingly,  in  terms  of  topic-relevance  per¬ 
formance,  the  best  results  are  observed  for  the  TREC  2007  topics 
across  all  baselines,  which  suggests  that  this  set  of  topics  was  rela¬ 
tively  easier  when  compared  to  the  2006  and  2008  topics. 

4.3  Polarity  Task 

In  the  polarity  task,  we  apply  our  dictionary-based  approach  once 
more,  with  the  exception  that  the  dictionaries  used  for  retrieving 


Run 

MAPre; 

P@10rei 

MAPop 

P@10op 

uogBLProx 

0.4141 

0.6840 

0.3464 

0.5820 

uogOPblintL 

0.4218 

0.7260* 

0.3607* 

0.6220* 

uogOPblofL 

0.4281* 

0.7240* 

0.3665* 

0.6360* 

uogOPblPr 

0.4149 

0.7040 

0.3629* 

0.6300 

uogOPblPrinL 

0.4142 

0.7040 

0.3636* 

0.6300 

uogBLProxCE 

0.4219 

0.7060 

0.3531 

0.6100 

uogOPb2intl 

0.4237 

0.7240 

0.3617* 

0.6340 

uogOPb2ofL 

0.4342* 

0.7340* 

0.3709* 

0.6380* 

uogOPb2Pr 

0.4116 

0.7040 

0.3597* 

0.6200 

uogOPb2PrintL 

0.4115 

0.7100 

0.3604* 

0.6260 

baseline  1 

0.4032 

0.7320 

0.3239 

0.5800 

uogOPlintL 

0.4174* 

0.7440 

0.3512* 

0.6380* 

uogOPlofL 

0.4073* 

0.7460 

0.3526* 

0.6460* 

uogOPIPr 

0.4115* 

0.6880 

0.3529* 

0.6040 

uogOPIPrintL 

0.4120* 

0.6880 

0.3564* 

0.5980 

baseline  2 

0.3107 

0.6480 

0.2639 

0.5500 

uogOP2intL 

0.3029 

0.2621 

0.5660 

uogOP2ofL 

0.3123 

0.2712 

0.5540 

uogOP2Pr 

0.3048 

0.2692* 

0.5420 

uogOP2PrintL 

0.3045 

0.2692* 

0.5380 

baseline  3 

0.4343 

0.6440 

0.3564 

0.5540 

uogOP3intL 

0.4391* 

0.7280* 

0.3669* 

0.6340* 

uogOP3ofL 

0.4419* 

0.6980 

0.3728* 

0.6060* 

uogOP3Pr 

0.4315 

0.7000 

0.3685* 

0.6200 

uogOP3PrintL 

0.4302 

0.7020 

0.3704* 

0.6180 

baseline  4 

0.4724 

0.7440 

0.3822 

0.6160 

uogOP4intL 

0.4750 

0.7520 

0.3964* 

0.6400 

uogOP4ofL 

0.4710 

0.7640 

0.3963* 

0.6600* 

uogOP4Pr 

0.4431 

0.6940 

0.3752 

0.5980 

uogOP4PrintL 

0.4397 

0.6920 

0.3753 

0.6040 

baseline  5 

0.3745 

0.7040 

0.2988 

0.5300 

uogOP5extL 

0.3713* 

0.6900 

0.3033* 

0.5640* 

uogOP5ofL 

0.3777* 

0.7020 

0.3098* 

0.5660* 

uogOP5Pr 

0.3894* 

0.7040 

0.3312* 

0.6160* 

uogOP5PrintL 

0.3915* 

0.7140 

0.3345* 

0.6240* 

Table  3:  Results  of  submitted  runs  in  the  opinion-finding  task 
over  7  different  baselines  for  topics  1001-1050.  An  asterisk 
(*)  indicates  a  significant  difference  (p  <  0.05)  from  the  cor¬ 
responding  basefine  run  according  to  the  Wiicoxon  matched- 
pairs  signed-ranks  test. 

positive  and  negative  blog  posts  are  respectively  extracted  from  the 
strong  positive  and  negative  words  in  OpinionFinder’s  dictionary. 
The  negative  dictionary  comprises  a  total  of  2.547  words  while  the 
positive  one  comprises  1,153  words  in  total. 

Table  6  shows  the  performance  of  our  polarity  runs  in  terms 
of  both  topic-relevance  and  opinion-finding  across  all  baselines. 
Overall,  none  of  our  approaches  significantly  differs  from  their  re¬ 
spective  baselines  according  to  the  Wilcoxon  matched-pairs  signed- 
ranks  test  with  p  <  0.05.  In  fact,  we  can  observe  minor  improve¬ 
ments  for  retrieving  negative  documents  and  a  slight  degradation 
for  retrieving  positive  documents. 

Analogously  to  the  analysis  conducted  in  the  previous  section, 
Table  7  shows  the  negative  and  positive  MAP  performances  of  our 
deployed  approach  for  the  polarity  task  (DictOF,  in  the  table)  over 
each  of  the  5  standard  baselines  as  well  as  the  average  performances 
of  these  runs  across  the  baselines.  For  each  baseline,  the  median 
performance  of  all  10  TREC  runs  that  were  deployed  over  all  5 
baselines  is  also  shown.  Front  Table  7.  we  can  see  that,  although 
having  decreased  the  baseline  performances  on  average,  our  ap¬ 
proach  stands  well  above  the  median.  These  very  low  median  val¬ 
ues,  in  turn,  attest  the  difficulty  of  this  task. 

Overall,  our  participation  in  the  TREC  2008  Blog  track  was  very 
successful.  Our  two  submitted  baseline  runs  performed  well  above 
the  median  performance  of  all  participants.  In  the  opinion-finding 
task,  most  of  our  approaches  provided  improvements  over  all  the 


baseline  1 

baseline2 

baseline3 

baseline4 

baseline5 

improvement  | 

Baseline 

0.3239 

0.2639 

0.3564 

0.3822 

0.2988 

mean 

stdev 

+Dict 

0.3512* 

0.2621 

0.3669* 

0.3964* 

0.3033* 

+3.18% 

3.38% 

+OpinionFinder 

0.3526* 

0.2712 

0.3728* 

0.3963* 

0.3098* 

+4.72% 

2.40% 

+ProxSent 

0.3529* 

0.2692* 

0.3685* 

0.3752 

0.3312* 

+4.67% 

5.18% 

+Dict+ProxSent 

0.3564* 

0.2692* 

0.3704* 

0.3753 

0.3345* 

+5.22% 

5.70% 

TREC  median 

0.3493 

0.2705 

0.3705 

0.3846 

0.3010 

+0.76% 

0.73% 

Table  4:  Opinion  MAP  over  5  standard  baselines  and  average  improvement  for  topics  1001-1050.  An  asterisk  (*)  indicates  a  signifi¬ 
cant  difference  (p  <  0.05)  from  the  corresponding  baseline  run  according  to  the  Wilcoxon  matched-pairs  signed-ranks  test. 


2006(851-900) 

2007  (901-950) 

2008  (1001-1050) 

All  150  topics 

Run 

MAPre; 

MAPop 

MAPrei 

MAP0p 

MAPrei 

MAP0p 

MAPrei 

MAP0p 

uogBLProx 

0.3366 

0.2224 

0.4373 

0.3265 

0.4141 

0.3464 

0.3960 

0.2984 

uogOPblofL 

0.3307 

0.2341* 

0.4375 

0.3520* 

0.4281* 

0.3665* 

0.3988 

0.3175* 

uogBLProxCE 

0.3459 

0.2351 

0.4507 

0.3393 

0.4219 

0.3531 

0.4062 

0.3091 

uogOPb2ofL 

0.3449 

0.2428* 

0.4529 

0.3584* 

0.4342* 

0.3709* 

0.4106* 

0.3240* 

baseline  1 

0.3004 

0.1905 

0.4043 

0.2758 

0.4032 

0.3239 

0.3693 

0.2634 

uogOPIPrintL 

0.3326* 

0.2595* 

0.4609* 

0.3513* 

0.4120* 

0.3564* 

0.4019* 

0.3224* 

baseline  2 

0.3156 

0.2296 

0.3881 

0.3034 

0.3107 

0.2639 

0.3381 

0.2656 

uogOP2PrintL 

0.3082 

0.2410* 

0.4069* 

0.3415* 

0.3045 

0.2692* 

0.3399* 

0.2839* 

baseline  3 

0.3768 

0.2545 

0.4619 

0.3489 

0.4343 

0.3564 

0.4244 

0.3199 

uogOP3ofL 

0.3769 

0.2705* 

0.4657 

0.3665* 

0.4419* 

0.3728* 

0.4282* 

0.3366* 

baseline  4 

0.4300 

0.3022 

0.5303 

0.3784 

0.4724 

0.3822 

0.4776 

0.3543 

uogOP4intL 

0.4240 

0.3134* 

0.5428* 

0.3959* 

0.4750 

0.3964* 

0.4806 

0.3686* 

baseline  5 

0.4046 

0.2632 

0.5465 

0.3805 

0.3745 

0.2988 

0.4419 

0.3141 

uogOP5PrintL 

0.4138 

0.3012* 

0.5510 

0.4104* 

0.3915* 

0.3345* 

0.4521* 

0.3487* 

Table  5:  Results  of  our  best  submitted  runs  for  each  baseline  on  the  2008  topics  across  4  sets  of  topics.  An  asterisk  (*)  indicates  a 
significant  difference  ( p  <  0.05)  from  the  corresponding  baseline  run  according  to  the  Wilcoxon  matched-pairs  signed-ranks  test. 


Run 

MAPneg 

P@10„eff 

MAPpos 

P@10pOS 

uogBLProx 

0.1218 

0.1320 

0.1376 

0.1800 

uogPLbll 

0.1225 

0.1440 

0.0866 

0.1260 

uogBLProxCE 

0.1176 

0.1400 

0.1388 

0.1760 

uogPLb21 

0.1176 

0.1420 

0.1372 

0.1700 

baseline  1 

0.1175 

0.1700 

0.1364 

0.1860 

uogPLll 

0.1076 

0.1400 

0.1272 

0.1820 

baseline2 

0.0865 

0.1420 

0.0951 

0.1400 

uogPL21 

0.0867 

0.1420 

0.0942 

0.1340 

baseline3 

0.1266 

0.1520 

0.1376 

0.1680 

uogPL3 1 

0.1203 

0.1440 

0.1345 

0.1800 

baseline4 

0.1288 

0.1600 

0.1532 

0.1980 

uogPL41 

0.1301 

0.1580 

0.1394 

0.1700 

baseline5 

0.1085 

0.1680 

0.1229 

0.1780 

uogPL5 1 

0.1067 

0.1700 

0.1179 

0.1440 

Table  6:  Results  of  submitted  runs  in  the  polarity  task  for  topics 
1001-1050  over  7  different  baselines. 

baselines  and  the  TREC  median  performance.  Moreover,  all  of 
them  showed  to  be  robust,  as  demonstrated  by  their  average  im¬ 
provement  across  the  standard  baselines.  Finally,  although  not  pro¬ 
viding  a  significant  improvement  over  the  baselines,  our  polarity 
approach  performed  fairly  above  the  median  performance  of  the 
participants  in  this  task. 

5.  ENTERPRISE  TRACK: 

DOCUMENT  SEARCH  TASK 

In  our  participation  in  the  Enterprise  track  document  search  task, 
we  aim  to  investigate  how  external  resources,  such  as  the  Google 
and  Yahoo!  Web  search  engines,  can  be  used  to  enhance  the  re¬ 
trieval  performance  through  a  collection  enrichment  approach.  Fur¬ 
thermore,  we  test  how  the  selective  application  of  collection  en¬ 
richment  can  further  improve  retrieval  effectiveness.  Section  5.1 
describes  the  selective  application  of  collection  enrichment.  Sec¬ 
tion  5.2  presents  our  experiments. 


5.1  Selective  Application  of  Collection  Enrich¬ 
ment 

For  Enterprise  document  search,  query  expansion  may  fail  as  the 
Enteiprise  intranets  are  often  created  by  a  small  number  of  individ¬ 
uals,  which  lead  to  the  Enterprise  collection  having  limited  use  of 
alternative  lexical  representations.  In  particular,  this  could  lead  to 
pseudo-relevance  sets  of  poor  quality.  In  this  case,  it  can  be  bene¬ 
ficial  to  use  collection  enrichment,  which  expands  the  initial  query 
by  taking  into  account  a  pseudo-relevance  set  based  on  larger  and 
higher-quality  external  resources  [14], 

We  experiment  using  five  different  external  resources,  namely 
Wikipedia,  Yahoo!  Web,  Google  Web,  Google  Scholar,  and  Google 
Books.  The  Wikipedia  website  provides  the  procedures  about  how 
to  download  the  Wikipedia  data2.  For  the  external  resources  of  Ya¬ 
hoo!  Web,  Google  Web,  Google  Scholar,  and  Google  Books,  we 
submit  queries  to  each  of  these  search  engines  and  then  download 
their  returned  results.  In  particular,  we  discriminate  the  Yahoo! 
Web  resource  into  Yahoo!  Web  ANY  and  Yahoo!  Web  PDF  accord¬ 
ing  to  the  restriction  on  the  type  of  the  retrieved  documents  for  a 
given  query.  For  example,  Yahoo!  Web  ANY  means  that  we  do 
not  apply  any  restriction  on  the  retrieved  documents  when  we  sub¬ 
mit  a  query  to  the  Yahoo!  Web  search  engine,  while  Yahoo!  Web 
PDF  means  that  we  restrict  the  retrieved  documents  to  be  PDF  files 
only.  We  make  the  same  kind  of  discrimination  for  Google  Web 
and  Google  Scholar  as  well. 

We  hypothesise  that  not  all  queries  benefit  equally  from  the  ap¬ 
plication  of  collection  enrichment.  Therefore,  we  use  query  per¬ 
formance  predictors  to  selectively  apply  collection  enrichment  on 
a  per-query  basis.  Various  query  performance  predictors  have  been 
studied  in  [13]  and  shown  to  be  useful  and  low-cost.  In  our  ex¬ 
periment,  we  use  two  of  these  predictors,  namely,  the  y2  and  the 
Average  Inverse  Collection  Term  Frequency  (AvICTF)  predictors. 

2 

http : / / en . wikipedia . org/ wiki /Wikipedia : 
Database.download 


negative 

baseline  1 

baseline2 

baseline3 

baseline4 

baseline5 

improvement  | 

Baseline 

0.1175 

0.0865 

0.1266 

0.1288 

0.1085 

mean 

stdev 

+DictOF 

0.1076 

0.0867 

0.1203 

0.1301 

0.1067 

-2.76% 

3.92% 

TREC  median 

0.0597 

0.0457 

0.0743 

0.0677 

0.0453 

-48.49% 

2.66% 

positive 

baseline  1 

baseline2 

baseline3 

baseline4 

baseline5 

improvement  | 

Baseline 

0.1364 

0.0951 

0.1376 

0.1532 

0.1229 

mean 

stdev 

+DictOF 

0.1272 

0.0942 

0.1345 

0.1394 

0.1179 

-4.60% 

3.29% 

TREC  median 

0.0953 

0.0547 

0.0955 

0.0973 

0.0708 

-36.79% 

17.48% 

Table  7:  Negative  and  positive  MAP  over  5  standard  baselines  and  average  improvement  for  topics  1001-1050. 


The  definition  of  72  is  given  as  follows: 

72=^^  (18) 

min 

where  idfmax  and  idfmin  are  the  maximum  and  minimum  idf 
among  the  query  terms  in  the  query  Q,  respectively.  The  idf  of 
each  query  term  t  is  computed  as  follows: 


idf  {t) 


log2  (N  +  0.5) /Nt 


log2  (. N  +  1) 


(19) 


where  Nt  is  the  number  of  documents  in  which  the  query  term  t 
appears  and  N  is  the  number  of  documents  in  the  whole  collection. 
The  definition  of  AvICTF  is  given  as  follows: 


AvICTF  = 


iog2  nQ 


tokencou 

coll 


ql 


(20) 


where  ql  is  the  query  length,  tfco u  is  the  number  of  occurrences  of 
a  query  term  in  the  whole  collection  and  tokencoii  is  the  number 
of  tokens  in  the  whole  collection. 

Our  decision  mechanism  is  given  as  follows: 


1.  Expand  the  initial  query  on  the  local  resource  if  and  only 
if  the  prediction  score  obtained  from  the  local  resource  is 
higher  than  a  threshold  score  and  the  prediction  score  ob¬ 
tained  from  the  external  resource. 


2.  Expand  the  initial  query  on  the  external  resource  if  and  only 
if  the  prediction  score  obtained  from  the  external  resource 
is  higher  than  a  threshold  score  and  the  prediction  score  ob¬ 
tained  from  the  internal  resource. 


3.  Disable  the  expansion  on  the  initial  query  if  and  only  if  the 
prediction  scores  obtained  from  the  external  and  internal  re¬ 
sources  are  all  lower  than  a  threshold  score. 


In  addition,  our  decision  mechanism  is  summarised  in  Table  8. 

5.2  Experiments 

We  submitted  four  runs,  all  of  which  apply  the  PL2F  DFR  field- 
based  weighting  model.  More  details  about  this  weighting  model 
can  be  found  in  Section  2.1.  Three  document  fields,  namely,  body, 
title,  and  anchor  text  of  incoming  hyperlinks  are  used.  The  details 
of  our  submitted  mns  are  given  below,  while  Table  9  summarises 
their  salient  features. 


•  Run  uogTrEDbl  tests  how  effective  the  application  of  query 
terms  proximity  in  the  DFR  framework  is  by  using  the  pBiL2 
randomness  model.  More  details  about  query  terms  proxim¬ 
ity  in  the  DFR  framework  can  be  found  in  Section  2.2. 

•  Run  uogTrEDQE  tests  how  effective  the  uniform  application 
of  query  expansion  to  all  queries  is  by  using  the  Bol  term 
weighting  model.  More  details  about  the  Bol  term  weighting 
model  can  be  found  in  Section  2.3. 


•  As  we  hypothesise  that  not  all  queries  benefit  equally  front 
the  application  of  collection  enrichment,  we  propose  to  use 
a  query  performance  predictor  to  selectively  apply  collection 
enrichment  on  a  per-query  basis.  Run  uogTrEDSelW  investi¬ 
gates  how  effective  the  selective  application  of  collection  en¬ 
richment  is.  The  external  resource  in  this  case  is  Wikipedia, 
and  we  use  72  as  our  predictor,  as  it  performed  better  than 
the  AvICTF  predictor  during  our  training  process. 

•  Run  uogTrEDSE2  also  investigates  how  effective  the  selec¬ 
tive  application  of  collection  enrichment  is.  In  particular,  we 
combine  the  results  from  several  external  resources.  After 
training,  we  chose  to  combine  the  Yahoo!  Web  ANY  and 
Yahoo!  Web  PDF  external  resources  and  the  AvICTF  pre¬ 
dictor. 


Run 

Techniques 

uogTrEDbl 

uogTrEDQE 

uogTrEDSelW 

uogTrEDSE2 

PL2F  +  proximity 

PL2F  +  query  expansion 

PL2F  +  selective  CE  on  a  single  resource 
PL2F  +  selective  CE  on  combined  resources 

Table  9:  Techniques  applied  in  the  submitted  runs  in  the  En¬ 
terprise  track  document  search  task. 

Table  10  summarises  the  results  of  our  official  and  unofficial 
runs  on  the  final  63  judged  queries.  The  table  shows  that  the  query 
terms  proximity  technique  makes  a  marginal  improvement  on  the 
retrieval  performance  over  the  PL2F  baseline  (IDO  vs.  ID  1 ).  In 
addition,  we  observe  that  the  runs  with  the  application  of  query 
expansion  or  collection  enrichment  outperform  the  PL2F  baseline 
(IDO  vs.  ID2/ID5).  Moreover,  the  selective  application  of  collec¬ 
tion  enrichment  makes  a  marked  improvement  on  retrieval  perfor¬ 
mance  (ID3  vs.  ID2/ID5).  In  particular,  the  improvement  is  sig¬ 
nificant  (p  <  0.05)  in  terms  of  MAP  according  to  the  Wilcoxon 
matched-pairs  signed-ranks  test.  We  also  find  that  it  is  very  impor¬ 
tant  to  choose  an  appropriate  external  resource  (ID6  vs.  ID7/ID8) 
and  an  appropriate  predictor  (ID3  vs.  ID6)  before  selectively  apply¬ 
ing  collection  enrichment.  Finally,  we  notice  that  the  combination 
of  external  resources  makes  a  slight  improvement  over  any  single 
external  resource  (ID4  vs.  ID7/ID8). 

Overall,  in  the  Enterprise  track  document  search  task,  we  have 
shown  that  using  query  performance  predictors  to  selectively  apply 
collection  enrichment  on  a  per-query  basis  can  enhance  the  retrieval 
performance. 

6.  RELEVANCE  FEEDBACK  TRACK 

In  the  first  Relevance  Feedback  track,  we  aim  to  test  the  effec¬ 
tiveness  of  the  DFR  query  expansion  framework  in  various  rele¬ 
vance  feedback  settings.  In  addition,  we  apply  a  novel  technique 
that  expands  queries  on  surrogates  of  the  feedback  documents,  in¬ 
stead  of  the  raw  documents  themselves. 


score_E  >  T  score_L  >  score_E 
Trae  or  False  True 

True  False 

False  True  or  False 

Table  8:  The  decision  mechanism  of  the  selective  application  of  collection  enrichment.  score_L  and  score_E  denote  the  prediction 
scores  on  the  local  and  external  resources,  respectively.  T  is  a  threshold  score,  which  needs  an  appropriate  setting.  We  used  T  =  0  in 
the  submitted  runs,  local,  external  and  disabled  in  the  column  Decision  indicate  expanding  the  initial  query  on  the  local  resource, 
external  resource  and  disabling  the  expansion,  respectively. 

Technique 
PL2F 

+  Proximity 
+Query  Expansion 
+  Selective  CE  (wiki) 

+  Selective  CE  (YWA  +  YWP) 

PL2F 

+  CE  (wiki) 

+  Selective  CE  (wiki) 

+  Selective  CE  (YWA) 

+  Selective  CE  (YWP) _ 

Table  10:  The  results  of  our  official  and  unofficial  runs  in  the  Enterprise  track  document  search  task.  The  highest  value  in  each 
column  is  highlighted  in  bold. 


score_L  >  T 
True 

True  or  False 
False 


Decision 

local 

external 

disabled 


6.1  Document  Surrogates  Creation 

We  expand  the  query  from  document  surrogates,  instead  of  all 
text  content  in  the  documents.  The  document  surrogates  are  created 
by  a  low-cost,  syntactically-based  information  processing  model, 
which  uses  surface-syntactic  evidence  in  order  to  automatically  iden¬ 
tify  informative  content  and  to  reduce  the  noise  from  any  textual 
input.  We  use  the  surface-syntactic  approach  to  prune  the  feedback 
documents  before  selecting  the  query  expansion  terms,  allowing 
(noisy)  terms  carried  in  unusual  syntactic  structures  to  be  ignored. 

We  use  part-of-speech  (POS)  n-grams  [4,  16]  to  detect  noise 
in  the  indexed  documents.  POS  n-grams  are  n-grams  of  parts-of- 
speech,  which  are  extracted  from  a  POS-tagged  sentence  in  a  recur¬ 
rent  and  overlapping  way.  For  example,  for  a  sentence  ABCDEFG, 
where  parts-of-speech  are  denoted  by  the  single  letters  A,  B,  C,  D, 
E,  F,  G,  and  where  POS  n-grams  have  length  n  =  4,  the  POS  n- 
grams  extracted  are  ABCD,  BCDE,  CDEF,  and  DEFG.  The  order 
in  which  the  POS  n-grams  occur  in  the  sentence  is  ignored.  For 
each  sentence,  all  possible  POS  n-grams  are  extracted. 

Our  technique  is  based  on  the  fact  that  high-frequency  POS  n- 
grams  correspond  mostly  to  sequences  of  words  that  include  rela¬ 
tively  little  noise,  whereas  low-frequency  POS  n-grams  correspond 
mostly  to  sequences  of  words  that  include  relatively  more  noise  [16]. 
The  only  resources  needed  are  a  POS  tagger  and  a  collection  of  doc¬ 
uments.  This  can  be  any  collection  of  documents  of  a  reasonable 
size,  not  necessarily  the  target  collection,  i.e.,  the  collection  from 
which  we  retrieve  documents  [15]. 

Our  methodology  is  as  follows.  We  extract  POS  n-grams  from 
a  collection  of  documents  and  count  their  frequency.  We  refer  to 
these  POS  n-grams  as  global  POS  n-grams.  We  rank  these  global 
POS  n-grams  according  to  their  frequency  in  the  collection  (in  de¬ 
creasing  order).  We  refer  to  this  ranked  list  as  global  list.  We  em¬ 
pirically  set  a  cutoff  threshold  9  of  POS  n-gram  rank  in  the  global 
list  and  we  assume  that  everything  below  this  threshold  corresponds 
to  estimated  noise  (Figure  1).  We  then  extract  POS  n-grams  from 
the  text  we  wish  to  process.  For  each  POS  n-gram  drawn  from  the 
text,  we  determine  its  position  in  the  global  list.  Whenever  this  rank 
is  below  the  threshold,  we  remove  the  POS  n-gram  and  its  corre¬ 
sponding  sequence  of  words  from  the  document,  regardless  of  any 
other  POS  n-grams  that  overlap  it. 


most  frequent  POS  n-grams 


least  frequent  POS  n-grams 

Figure  1:  POS  n-grams  ranked  by  frequency. 

We  remove  POS  n-grams  in  a  uniform  way,  i.e.,  by  setting  9 
to  the  same  value  for  all  documents.  We  use  the  POS  n-grams 
extracted  from  WT10G  to  reduce  noise  from  the  index  of  .GOV2. 
In  particular,  we  use  TreeTagger3  for  the  POS  tagging  of  WT10G. 
The  POS  n-grams  extracted  from  WT10G  provide  us  with  a  global 
list  of  POS  n-grams.  Overall,  we  extract  25,070  POS  n-grams  from 
WT10G  with  9  =  17,070. 

6.2  Experiments 

In  our  participation  in  the  Relevance  Feedback  track,  we  apply 
the  DPH  model  (see  Equation  (6))  for  document  ranking,  and  the 
KL  model  (Equations  (12)  &  (13))  for  query  expansion.  More¬ 
over,  we  take  into  account  the  proximity  between  the  original  query 
terms  by  applying  the  pBiL  randomness  model  [17]  as  given  by 
Equation  (9),  except  that  Normalisation  2  is  not  applied.  We  use 
these  three  parameter-free  models  for  experimentation  so  that  we 
focus  on  studying  the  query  expansion  aspect. 

As  the  applied  weighting  models  and  the  query  expansion  model 
are  parameter-free,  the  only  parameters  that  we  need  to  fix  are  the 
number  of  expanded  terms  ( expJterm )  extracted  from  the  exp-doc 
top-ranked  documents.  For  training  purposes,  we  use  the  65  odd- 
numbered  Terabyte  track  ad-hoc  topics  for  which  there  are  at  least  4 
relevant  documents  in  the  top  50  documents  returned  by  DPH.  We 

3Details  on  the  parameters  and  tagset  used  can  be  found  in  [16]. 


scan  a  wide  range  of  possible  values  of  expJLoc  and  expJerm, 
namely  every  exp-doc  value  within  2  <  expjdoc  <  10,  and  every 
expJerm  value  within  10  <  expJerm  <  100  with  an  inter¬ 
val  of  5.  in  order  to  maximise  MAP.  We  obtain  exp-doc  =  3  and 
expJerm  =  35,  which  are  used  in  our  submitted  runs. 

We  submitted  two  lists  of  runs  for  each  set  of  feedback  docu¬ 
ments  as  follows.  The  first  list,  namely  runs  uogRF08.{A-E}l, 
applies  query  expansion  on  surrogates  of  the  positive  documents 
for  the  feedback  sets  B-E,  and  on  surrogates  of  the  top-3  returned 
documents  for  the  feedback  set  A.  The  second  list,  namely  runs 
uogRF08.{A-E}2,  applies  query  expansion  on  positive  documents 
for  sets  B-E,  and  on  the  top-3  returned  documents  for  set  A. 

Table  11  provides  the  retrieval  performance  of  our  submitted 
runs  as  given  by  three  different  evaluation  measures:  Top  10  AP, 
MCT  AP,  and  Stat  AP.  From  Table  1 1,  we  conclude  the  following: 

1.  It  is  beneficial  to  use  the  positive  feedback  documents  for 
query  expansion.  The  retrieval  performance  for  sets  B-E  is 
markedly  higher  than  that  of  set  A. 

2.  Query  expansion  on  document  surrogates  has  a  better  re¬ 
trieval  performance  in  terms  of  Top  10  AP  than  query  expan¬ 
sion  on  the  raw  documents. 


Run 

Top  10  AP 

MCT  AP 

Stat  AP 

Positive  QE  on 

Surrogates 

uogRF08.Al 

0.1971 

0.0560 

0.2919 

uogRF08.Bl 

0.2202 

0.0641 

0.3075 

uogRF08.Cl 

0.2274 

0.0673 

0.3205 

uogRF08.Dl 

0.2323 

0.0673 

0.3379 

uogRF08.El 

0.2272 

0.0661 

0.3420 

Positive  QE 

uogRF08.A2 

0.1921 

0.0563 

0.2843 

uogRF08.B2 

0.2088 

0.0658 

0.3092 

uogRF08.C2 

0.2118 

0.0697 

0.3222 

uogRF08.D2 

0.2219 

0.0695 

0.3393 

uogRF08.E2 

0.2234 

0.0682 

0.3373 

TREC  median 

0.1427 

0.0564 

0.1946 

Table  11:  Results  of  submitted  runs  in  the  Relevance  Feedback 
track. 

We  have  also  conducted  additional  experiments  to  test  the  ef¬ 
fectiveness  of  the  techniques  applied  in  our  participation  using  the 
31  topics  for  which  the  top-10  returned  documents  were  judged  by 
assessors.  We  obtain  the  following  results: 

•  First,  we  evaluate  the  usefulness  of  pseudo-query  expansion 
compared  with  the  first-pass  retrieval.  We  obtain  MAP  0. 1368 
for  first-pass  retrieval,  and  MAP  0.1982  for  pseudo-query 
expansion  using  the  top-3  returned  documents  for  relevance 
feedback.  Pseudo-query  expansion  provides  a  44.88%  statis¬ 
tically  significant  improvement  over  the  first-pass  retrieval, 
which  attests  the  effectiveness  of  this  technique. 

•  Second,  we  test  if  the  use  of  positive  feedback  documents 
(Pos  QE)  provides  a  better  retrieval  performance  than  pseudo¬ 
query  expansion  (Pseudo-QE).  From  Table  12,  we  can  see 
that  positive  feedback  documents  do  bring  useful  informa¬ 
tion,  particularly  for  the  settings  which  have  a  relatively  larger 
number  of  feedback  documents  (see  results  for  sets  D  and  E). 

•  Third,  we  evaluate  if  query  expansion  on  document  surro¬ 
gate  improves  the  retrieval  performance.  Table  13  provides 
the  related  results.  In  general,  we  find  no  obvious  difference 
between  query  expansion  on  the  whole  documents  and  query 


expansion  on  the  document  surrogates  on  the  31  topics  used. 
No  statistically  significant  difference  is  found  between  their 
corresponding  MAP  values. 


Set 

Pseudo-QE 

Pos  QE 

diff- 

B 

0.1982 

0.2153 

8.63 

C 

0.1982 

0.2240 

13.02 

D 

0.1982 

0.2330 

17.56* 

E 

0.1982 

0.2343 

18.21* 

Table  12:  The  MAP  values  obtained  by  (pseudo)  query  expan¬ 
sion  (QE),  and  by  positive  query  expansion  (Pos  QE).  A  statis¬ 
tically  significant  difference  between  the  two  MAP  values  at  the 
0.05  confidence  level  is  marked  with  a  star. 


Set 

QE/Pos  QE 

SurQE 

diff- 

A 

0.1982 

0.2001 

mmitw l 

B 

0.2153 

0.2187 

1.58 

C 

0.2240 

0.2275 

1.56 

D 

0.2330 

0.2376 

1.97 

E 

0.2343 

0.2272 

Table  13:  The  MAP  values  obtained  by  pseudo/positive  query 
expansion  (QE/Pos  QE),  and  by  query  expansion  on  document 
surrogates  (Sur  QE). 

Overall,  we  have  shown  that  expanding  queries  on  positive  docu¬ 
ments  is  markedly  better  than  pseudo-relevance  feedback.  We  have 
also  shown  that  it  is  beneficial  to  expand  the  queries  over  a  refined 
representation  of  the  feedback  documents,  namely,  the  document 
surrogates.  In  addition,  we  have  investigated  the  usefulness  of  neg¬ 
ative  feedback  documents  for  query  expansion  based  on  an  adap¬ 
tation  of  Rocchio’s  relevance  feedback  algorithm  [33]  to  the  DFR 
query  expansion  framework.  We  have  not  found  the  negative  docu¬ 
ments  to  be  useful  for  relevance  feedback,  in  line  with  the  findings 
of  other  participants. 

7.  ENTERPRISE  TRACK: 

EXPERT  SEARCH  TASK 

We  participate  in  the  expert  search  task  of  the  TREC  2008  Enter¬ 
prise  track  with  the  aim  of  continuing  to  test  and  develop  our  novel 
Voting  Model  [22],  In  the  expert  search  task,  systems  are  asked 
to  rank  candidate  experts  with  respect  to  their  predicted  expertise 
about  a  query,  using  documentary  evidence  of  expertise  found  in 
the  collection. 

In  our  participation  in  the  expert  search  task  this  year,  we  follow 
our  central  themes  of  proximity  and  enrichment.  In  particular,  we 
use  an  advanced  proximity  extension  to  the  Voting  Model,  which 
uses  an  information-theoretic  DFR  model  to  calculate  the  informa¬ 
tiveness  of  a  candidate’s  name  occurring  in  close  proximity  to  the 
terms  of  the  query.  Moreover,  we  enrich  the  profiles  of  the  candi¬ 
date  experts  to  obtain  better  evidence  of  their  expertise. 

7.1  Voting  Model 

In  expert  search,  the  expertise  areas  of  the  candidates  are  repre¬ 
sented  to  the  system  by  documentary  evidence  of  expertise,  known 
as  candidate  profiles.  In  our  Voting  Model  for  expert  search,  instead 
of  directly  ranking  candidates  using  these  profiles,  we  consider  the 
ranking  of  documents  with  respect  to  the  query  Q,  which  we  denote 
R(Q).  We  propose  that  the  ranking  of  candidates  can  be  modelled 
as  a  voting  process  from  the  retrieved  documents  in  R(Q)  to  the 
profiles  of  candidates:  every  time  a  document  is  retrieved  and  is 


associated  with  a  candidate,  then  this  is  a  vote  for  that  candidate 
to  have  relevant  expertise  to  Q.  The  votes  for  each  candidate  are 
then  appropriately  aggregated  to  form  a  ranking  of  candidates,  tak¬ 
ing  into  account  the  number  of  voting  documents  for  that  candidate, 
and  the  relevance  score  of  the  voting  documents.  Our  Voting  Model 
is  extensible  and  general,  and  is  not  collection  or  topics  dependent. 

In  [22],  we  defined  twelve  voting  techniques  for  aggregating 
votes  for  candidates,  adapted  from  existing  data  fusion  techniques. 
In  this  work,  we  apply  only  the  robust  and  effective  expCombMNZ 
voting  technique  for  ranking  candidates.  expCombMNZ  ranks  can¬ 
didates  by  considering  the  sum  of  the  exponential  of  the  relevance 
scores  of  the  documents  associated  with  each  candidate’s  profile. 
Moreover,  it  includes  a  component  which  takes  into  account  the 
number  of  documents  in  R(Q)  associated  to  each  candidate,  hence 
explicitly  modelling  the  number  of  votes  made  by  the  documents 
for  each  candidate.  In  expCombMNZ,  the  score  of  a  candidate  C’s 
expertise  to  a  query  Q  is  given  by: 

score(C,Q )  =  | R(Q)  fl  profile(C)\ 

exp(score(d,Q))  (21) 

d  E  R(Q)D  profile(C) 

where  |-R(Q)  l~l  prof  ile(C)  |  is  the  number  of  documents  front  the 
profile  of  candidate  C  that  are  in  the  ranking  R(Q). 

Some  types  of  documents  can  have  many  topic  areas  and  many 
occurrences  of  candidate  names  (e.g.,  the  minutes  of  a  meeting). 
In  such  documents,  the  closer  a  candidate’s  name  occurrence  is  to 
the  query  terms,  the  more  likely  that  the  document  is  a  high  quality 
indicator  of  expertise  for  that  candidate  [6,  29].  To  this  end,  we 
define  a  voting  technique,  based  on  expCombMNZ,  which  takes 
into  account  the  proximity  of  each  candidate’s  name  occurrence  to 
the  query  terms  in  the  documents  [19].  We  measure  this  proximity 
using  the  DFR  term  proximity  model  defined  in  Section  2.2.  This 
model  is  designed  to  measure  the  informativeness  of  a  pair  of  query 
terms  occurring  in  close  proximity  in  a  document.  We  adapt  this  to 
the  expert  search  task  and  into  the  expCombMNZ  voting  technique 
(Equation  (21)),  by  measuring  the  informativeness  of  a  query  term 
occurring  in  close  proximity  to  a  candidate’s  name.  The  adapted 
voting  technique,  expCombMNZProx,  is  given  as  follows: 

score(C,  Q )  =  (22) 

\R(Q)Dprofile(C)\-  exp(^ 

d  E  R(Q)n  profile(C) 

score(d,Q)+  score(d,p)^j 

p=name(C)  Xt£Q 

Here,  p  —  (t ,  C)  is  a  tuple  of  a  term  t.  front  the  query  and  the  full 
name  of  candidate  C.  score(d,  p)  can  be  calculated  using  any  DFR 
weighting  model  [17];  for  efficiency  reasons,  we  use  the  pBiL2 
model  (Equation  (9)),  as  it  does  not  consider  the  frequency  of  the 
tuple  p  in  the  collection  but  only  in  the  document. 

Hence,  in  this  way,  we  are  able  to  use  the  same  weighting  model 
to  count  and  weight  candidate  occurrences  in  close  proximity  to 
query  terms  as  that  used  in  other  TREC  tracks  (Blog,  Relevance 
Feedback)  to  weight  query  term  occurrences  in  close  proximity. 
Note  that  this  approach  does  not  remove  evidence  of  expertise  for  a 
candidate  where  the  candidate’s  name  does  not  occur  near  a  query 
term,  as  this  may  result  in  a  relevant  candidate  not  being  retrieved 
for  a  difficult  query  (i.e.,  the  relevant  candidate  had  only  sparse  ev¬ 
idence  of  expertise).  Instead,  candidate  with  names  occurring  in 
close  proximity  to  query  terms  are  given  stronger  votes  in  the  Vot¬ 
ing  Model,  and  hence  should  be  ranked  higher  in  the  final  ranking. 

Some  voting  techniques  in  the  Voting  Model  can  suffer  from  be¬ 


ing  biased  towards  candidates  with  large  candidate  profiles  (many 
associated  documents).  To  neutralise  this  effect,  we  apply  a  nor¬ 
malisation  function  that  is  called  Norm2D  [21]: 

score(C,  Q)  =  score(C,  Q)  •  log  ^1  +  cpro  ■  \profite(C)  |  )  (23) 

where  cpro  is  a  free  parameter  ( cpro  >  0),  \profile(C)\  is  the 
size  of  the  profile  of  candidate  C,  measured  as  the  number  of  docu¬ 
ments.  and  avgdxpro  is  the  average  size  of  the  profile  of  all  candi¬ 
dates.  This  is  inspired  by  the  Normalisation  2  from  the  DFR  frame¬ 
work  (Equation  (3)). 

7.2  Enriching  Candidate  Profiles 

In  keeping  with  our  TREC  theme  this  year,  we  investigate  how 
enterprise  data  can  be  enriched  by  an  external  source  of  evidence. 

In  [30],  Serdyukov  &  Hiemstra  proposed  the  use  of  external  ev¬ 
idence  in  expert  search,  in  particular,  by  using  queries  submitted 
to  commercial  Web  search  engines.  In  this  work,  we  follow  their 
suggestion  for  identifying  useful  external  evidence.  However,  we 
develop  more  refined  methods  for  ranking  the  experts.  In  partic¬ 
ular,  we  actually  download  and  rank  all  of  the  expertise  evidence 
derived  from  a  given  source. 

In  order  to  identify  expertise  evidence  for  the  candidates  on  the 
Web,  we  build  new  queries,  which  we  call  “evidence  identification 
queries”.  These  evidence  identification  queries  involve  both  the  ac¬ 
tual  expert  search  query  (from  the  TREC  2007  Expert  search  task), 
and  the  name  of  a  candidate.  We  then  submit  these  evidence  iden¬ 
tification  queries  to  the  APIs  of  a  major  Web  search  engine,  which 
will  allow  Web  documents  specific  to  the  query  and  to  the  candidate 
to  be  retrieved.  In  particular,  each  query  contains: 

•  the  quoted  full  name  of  the  person:  e.g.,  “craig  macdonald”, 

•  the  name  of  the  organisation:  e.g.,  csiro, 

•  query  terms  without  any  quotations:  e.g.,  genetic  modifica¬ 
tion, 

•  a  directive  prohibiting  any  results  from  the  actual  organisa¬ 
tion  Web  site:  -site: csiro.au. 

The  use  of  the  name  of  the  organisation  helps  in  name  disam¬ 
biguation,  to  prevent  the  matching  of  any  content  not  related  to  the 
candidate  expert  in  question.  However,  this  will  also  prevent  the 
matching  of  evidence  for  a  candidate  front  a  previous  employer. 
The  prohibitive  -site  directive,  in  turn,  ensures  that  the  acquired 
expertise  evidence  does  not  overlap  with  the  intranet  collection. 

For  each  of  the  50  topics  in  the  TREC  2008  expert  search  task, 
we  submit  the  evidence  identification  queries  to  the  Yahoo!  Web 
search  engine,  for  the  top  100  candidates  suggested  by  our  baseline 
expert  search  engine.  From  the  search  listing  results,  we  extract  a 
list  of  URLs  associated  to  each  candidate.  For  each  expert  identi¬ 
fication  query,  a  maximum  of  24  results  are  extracted,  and  the  cor¬ 
responding  Web  pages  downloaded.  These  pages  fomi  the  profiles 
of  the  candidates.  Note  that  these  new  profiles  are  query-biased,  as 
only  documents  which  are  related  to  query  topic(s)  are  associated 
to  each  candidate. 

To  create  the  runs,  we  rank  all  downloaded  documents  which 
have  been  returned  by  the  Yahoo!  Web  search  engine  using  the 
PL2  document  weighting  model  (Equations  (1)  &  (3)).  Then,  a 
voting  technique  is  applied  to  convert  this  ranking  of  documents 
into  a  ranking  of  candidates  using  the  query-biased  profiles. 


7.3  Experiments  Setup,  Runs,  and  Results 

We  submitted  four  runs  to  the  expert  search  task  of  the  Enterprise 
track.  Along  with  the  unsubmitted  baselines,  these  are: 

•  uogTrEXFeMNZ  is  our  baseline  run  (unsubmitted).  It  ap¬ 
plies  the  PL2F  DFR  document  weighting  model  (see  Equa¬ 
tions  (1)  &  (2))  to  generate  the  underlying  ranking  of  doc¬ 
uments  from  the  CERC  collection,  combined  with  the  exp- 
CombMNZ  (Equation  (21))  voting  technique  to  rank  experts. 

•  uogTrEXFeMNZP  is  a  baseline  run  (unsubmitted),  which 
improves  upon  the  baseline  run  by  applying  query-term  prox¬ 
imity  (Equation  (9))  before  expCombmMNZ. 

•  uogTrEXfeNP  improves  upon  uogTrEXFeMNZP  by  apply¬ 
ing  candidate  size  normalisation  on  top  of  expCombMNZ. 

•  uogTrEXfePC  improves  upon  the  baseline  run  by  applying 
candidate-query  term  proximity,  expCombMNZProx  (Equa¬ 
tion  (22)),  instead  of  expCombMNZ. 

•  uogTrEXfeNPC  combines  the  previous  two  runs,  by  apply¬ 
ing  expCombMNZProx,  and  candidate  size  normalisation. 

•  uogTrEXeY  is  a  baseline  run  (unsubmitted),  which  applies 
the  PL2  DFR  document  weighting  model  on  the  externally 
obtained  Yahoo!  document  index. 

•  uogTrEXmix  combines  uogTrEXfeNPC  and  uogTrEXeY  by 
a  linear  mixture  combination. 

The  salient  features  of  the  runs  are  described  in  Table  14.  Note 
that,  for  all  runs  using  the  CERC  collection,  we  use  the  candidate 
profiles  identified  during  our  TREC  2007  participation  [9], 

Table  15  presents  the  retrieval  performance  of  the  runs  described 
above,  as  well  as  the  per-topic  median  and  best  runs  from  all  TREC 
participants.  Front  the  results,  we  draw  the  following  observations: 
all  of  our  submitted  runs  were  above  median;  our  best  perform¬ 
ing  run  was  uogTrEXfeNPC,  which  applied  expCombMNZProx 
and  normalisation;  applying  query  term  proximity  did  not  appear 
to  benefit  retrieval  performance  for  MAP,  but  did  improve  MRR 
(uogEXFeMNZ  vs.  uogEXFeMNZP);  candidate  normalisation  im¬ 
proved  retrieval  performance  (uogEXFeMNZP  vs.  uogTrEXfeNP 
and  uogTrEXfePC  vs.  uogTrEXfeNPC);  candidate-query  term  prox¬ 
imity  was  more  effective  than  query-term  proximity,  or  the  baseline 
(uogTrEXfePC  vs.  uogEXFeMNZ  and  uogEXFeMNZP),  while  ap¬ 
plying  normalisation  on  top  was  more  beneficial  (uogTrEXfePC  vs. 
uogTrEXfeNPC);  lastly,  the  usefulness  of  Yahoo!  for  expertise  ev¬ 
idence  mining  was  disappointing  (uogTrEXeY),  and  hindered  re¬ 
trieval  performance  when  combined  with  uogTrEXfeNPC. 


Run  Name 

Submitted 

MAP 

MRR 

P@10 

TREC  best 

- 

0.6844 

0.9909 

- 

TREC  median 

- 

- 

uogEXFeMNZ 

0.3484 

0.6550 

0.3130 

uogEXFeMNZP 

0.3148 

uogTrEXfeNP 

0.3218 

uogTrEXfePC 

uogTrEXfeNPC 

uogTrEXeY 

0.2428 

uogTrEXmix 

0.3748 

Table  15:  Retrieval  performances  of  the  our  Enterprise  track 
expert  search  task  runs,  and  also  the  TREC  per-topic  best  and 
median  runs. 

Overall,  we  conclude  that  we  successfully  participated  in  the 
TREC  2008  expert  search  task  of  the  Enterprise  track.  All  of  our 


submitted  runs  were  above  median,  and  our  normalisation  and  can¬ 
didate  query-term  proximity  features  were  successful  at  increasing 
baseline  retrieval  performance.  The  low  performance  of  the  Web- 
enriched  candidate  profiles  requires  further  investigation. 

8.  BLOG  TRACK: 

BLOG  DISTILLATION  TASK 

In  TREC  2008,  we  also  participate  in  the  blog  distillation  task  of 
the  Blog  track,  where  we  aim  to  test  the  applicability  of  our  Voting 
Model  [22]  to  this  task.  Firstly,  in  the  blog  distillation  task,  the 
aim  of  each  system  is  to  identify  the  blogs  that  have  a  principle 
recurring  interest  in  the  query  topic  [24].  We  believe  that  this  task 
can  be  seen  as  a  voting  process:  a  blogger  with  an  interest  in  a 
topic  will  blog  regularly  about  the  topic,  and  these  blog  posts  will 
be  retrieved  in  response  to  a  query  topic.  Each  time  a  blog  post 
is  retrieved  about  a  query  topic,  that  can  be  seen  as  a  vote  for  that 
blog  to  have  an  interest  in  the  topic  area.  In  [9]  &  [20],  we  showed 
that  the  task  can  be  successfully  modelled  using  the  Voting  Model. 
With  this  in  mind,  many  of  the  techniques  we  apply  in  this  task  are 
described  in  Section  7  above:  for  each  candidate  expert  C,  read 
blog;  for  each  document  d ,  read  blog  post.  The  set  of  posts  of  each 
blog  forms  the  blog’s  “candidate  profile”. 

We  also  investigate  the  use  of  a  feature  which  ascertains  if  the 
retrieved  posts  in  a  given  blog  for  a  topic  are  spread  across  the  time 
span  of  the  collection.  If  a  blogger  has  an  interest  in  a  topic  area, 
it  is  likely  that  he  or  she  will  continue  to  blog  about  the  topic  area 
repeatedly  and  frequently.  Indeed,  the  definition  for  a  relevant  blog 
in  the  blog  distillation  task  gives  a  clue  that  the  timing  of  on-topic 
posts  by  a  blog  may  have  an  impact  on  the  overall  relevance  of  the 
blog.  In  particular,  we  believe  that  a  relevant  blog  will  continue  to 
post  relevant  posts  throughout  the  timescale  of  the  collection. 

With  this  in  mind,  we  break  the  1 1 -week  period  of  the  Blogs06 
collection  into  a  series  of  DI  equal  intervals  (where  DI  is  a  pa¬ 
rameter).  Then,  for  each  blog,  we  measure  the  proportion  of  its 
posts  in  each  time  interval  that  were  retrieved  in  response  to  a 
query.  We  call  this  evidence  recurring  interests  (Dates),  and  de¬ 
fine  a  QscoreDates(C,  Q)  for  each  blog  C  as  follows  [20]: 

Q  score  Dates  (C,  Q)  =  (24) 

1  +  | R(Q)  fl  dateIntervah(posts(C))\ 

1  +  \dateIntervah(posts(C))\ 

where  dateIntervali(posts(C))  is  the  number  of  posts  of  blog 
C  in  the  ith  date  interval.  Note  that  we  smooth  this  probability 
distribution  using  Laplace  smoothing  to  combat  sparsity  problems 
(e.g.,  when  a  blog  had  no  posts  in  a  date  interval).  We  integrate  the 
QscoreDates{B,  Q)  evidence  as: 

score(C,  Q)  =  score(C,  Q )  x  QscoreDates(C,  Q)u  (25) 

where  lo  >  0  is  a  free  parameter.  We  use  DI  =  3,  which  ap¬ 
proximates  the  month  where  the  post  was  made  (the  corpus  time 
span  is  11  weeks),  and  ui  =  0.48.  Initial  experiments  found  that 
using  higher  values  for  DI  does  not  change  the  results,  due  to  the 
time  span  of  the  cotpus.  Finally,  note  that,  as  this  evidence  requires 
knowledge  of  the  ranking  of  posts  for  a  query,  it  has  to  be  calculated 
during  the  retrieval  phase,  but  without  adding  high  overheads. 

We  submitted  4  runs  to  the  blog  distillation  task  of  the  TREC 
2008  Blog  Track,  which  test  our  hypotheses  for  this  task.  Below, 
we  describe  our  submitted  and  unsubmitted  runs: 

•  uogTrBDfe  is  our  baseline  run.  It  uses  the  PL2F  weighting 
model  together  with  the  expCombMNZ  voting  technique  to 
score  the  predicted  relevance  of  blogs  to  the  query  topic. 


Run  Name 

|  Submitted 

Source 

Salient  Features 

uogEXFeMNZ 

■KH| 

CERC 

PL2F  +  expCombMNZ 

uogEXFeMNZP 

CERC 

PL2F  +  Proximity  +  expCombMNZ 

uogTrEXfeNP 

CERC 

+  Norm2D 

uogTrEXfePC 

CERC 

PL2F  +  expCombMNZProx 

uogTrEXfeNPC 

CERC 

+Norm2D 

uogTrEXeY 

Yahoo! 

PL2  +  expCombMNZ 

uogTrEXmix 

(both) 

mixture  of  uogTrEXeY  &  uogTrEXfeNPC 

Table  14:  Salient  features  of  our  Enterprise  track  expert  search  task  runs. 


•  uogTrBDfeN  is  an  unsubmitted  baseline  run.  In  addition  to 
PL2F  and  expCombMNZ,  it  applies  Norm2D  to  remove  any 
bias  towards  prolific  bloggers. 

•  uogTrBDfeNP  improves  on  the  baseline  run,  in  two  ways: 
Firstly,  by  boosting  the  rank  of  documents  in  the  document 
ranking  where  the  query  terms  occur  in  close  proximity  using 
the  pBiL2  DFR  terms  dependence  model  (Equation  (9));  sec¬ 
ondly,  we  use  the  Norm2D  normalisation  technique  (Equa¬ 
tion  (23)),  which  has  been  shown  to  be  useful  on  this  task  [20]. 

•  uogTrBDfeNPD  investigates  the  use  of  the  recurring  inter¬ 
ests  quality  evidence,  compared  to  uogTrBDfeNP. 

•  uogTrBDfeNWD  investigates  the  use  of  Wikipedia  for  col¬ 
lection  enrichment,  compared  to  uogTrBDfeNP,  similarly  to 
our  collection  enrichment  approach  in  the  document  search 
task  (Section  5).  In  particular,  to  enrich  the  topics,  we  ap¬ 
ply  the  Bol  term  weighting  model  (Equation  (10))  on  the  full 
content  of  a  Wikipedia  index  from  2008,  and  using  a  pseudo- 
relevance  set  of  size  exp-doc  =  30  documents,  we  expand 
each  query  with  expJberm  =  10  additional  terms. 

Table  16  summarises  the  salient  features  of  the  submitted  and 
unsubmitted  runs.  Table  17  presents  the  results  of  our  submitted 
runs  in  the  blog  distillation  task  of  the  Blog  track.  The  evaluation 
measures  in  this  task  are  mean  average  precision  (MAP),  mean  re¬ 
ciprocal  rank  (MRR),  and  precision  at  rank  10  (P@  10). 


Run  Name 

Submitted 

Salient  Features 

uogTrBDfe 

✓ 

PL2F  &  expCombMNZ 

uogTrBDfeN 

X 

+  Norm2D 

uogTrBDfeNP 

✓ 

+  pBiL2  +  Norm2D 

uogTrBDfeNPD 

✓ 

+  pBiL2  +  Norm2D  +  Dates 

uogTrBDfeNWD 

✓ 

+  Wikipedia  +  Norm2D  +  Dates 

Table  16:  Salient  features  of  our  Blog  track  blog  distillation 
task  submitted  and  unsubmitted  runs. 


Run  Name 

MAP 

MRR 

P@10 

TREC  median 

0.2224 

- 

- 

uogTrBDfe 

0.2028 

0.6595 

0.3560 

uogTrBDfeN 

0.2337 

0.6948 

0.3720 

uogTrBDfeNP 

0.2395 

0.7267 

0.3800 

uogTrBDfeNPD 

0.2437 

0.7310 

0.3740 

uogTrBDfeNWD 

0.2521 

0.7425 

0.4040 

Table  17:  The  mean  average  precision  (MAP),  Reciprocal  Rank 
(MRR),  and  precision  at  10  (P@10)  of  our  Blog  track  blog  dis¬ 
tillation  task  runs,  as  well  as  that  achieved  by  all  participants. 
MRR  and  P@10  achieved  by  all  participants  is  not  available. 

From  the  results,  we  note  that  applying  any  of  Norm2D,  pBiL2 
(proximity),  Dates  or  Wikipedia-based  collection  enrichment  re¬ 
sults  in  an  increase  in  retrieval  effectiveness  in  comparison  to  the 


baseline  run.  Moreover,  the  incremental  combination  of  these  tech¬ 
niques  brings  further  improvements,  suggesting  that  each  of  them 
may  address  a  different  dimension  of  the  blog  distillation  problem. 

9.  CONCLUSIONS 

In  TREC  2008,  we  have  participated  in  three  tracks,  namely 
the  Blog  track,  the  Enteiprise  track,  and  the  Relevance  Feedback 
track.  In  particular,  we  have  investigated  the  effectiveness  of  our 
proximity-based  models  in  different  tasks  as  well  as  the  use  fullness 
of  external  resources  for  collection  and  profile  enrichment. 

In  the  Blog  track  opinion-finding  task,  our  main  conclusion  is 
that  our  proposed  approaches  have  significantly  improved  over  all 
but  one  of  the  7  baselines  (our  proximity-based  approaches  do  not 
improve  over  baseline  4,  while  the  approaches  that  do  not  use  prox¬ 
imity  do  not  improve  over  baseline  2).  Moreover,  the  opinion¬ 
finding  performance  of  our  best-performing  runs  for  all  but  two 
baselines  has  increased  across  the  topics  for  the  three  years  of  the 
Blog  opinion-finding  task,  with  our  best  topic-relevance  perfor¬ 
mances  observed  for  the  2007  topics  over  all  baselines. 

In  the  Blog  track  blog  distillation  task,  we  have  shown  that  all 
of  our  approaches  individually  improve  over  the  baseline  and  also 
over  the  median  of  the  participating  groups.  Additionally,  the  combi¬ 
nation  of  these  individual  approaches  improves  even  further. 

In  our  participation  in  the  first  Relevance  Feedback  track,  we 
have  shown  that  query  expansion  on  the  set  of  positive  feedback 
documents  markedly  improves  over  the  first-pass  retrieval  baseline. 
Furthermore,  the  application  of  query  expansion  on  the  document 
surrogates  rather  than  the  raw  documents  has  shown  improved  per¬ 
formance  in  terms  of  Top  10  AR 

For  the  Enteiprise  track,  we  have  investigated  the  application  of 
suitable  external  resources.  In  the  document  search  task,  we  have 
shown  that  using  external  resources  through  collection  enrichment 
to  enhance  the  retrieval  performance  is  very  effective.  In  particu¬ 
lar,  the  selective  application  of  collection  enrichment  according  to 
the  query  performance  makes  a  significant  improvement  in  MAP. 
Moreover,  it  is  very  important  to  choose  an  appropriate  external  re¬ 
source  and  an  appropriate  query  performance  predictor  before  ap¬ 
plying  the  collection  enrichment  approach. 

In  the  Enterprise  track  expert  search  task,  we  have  successfully 
applied  our  Voting  Model  to  rank  candidate  experts.  Moreover,  we 
have  investigated  the  application  of  candidate  size  normalisation, 
candidate  query-term  proximity,  and  the  enrichment  of  candidate 
profiles  using  a  commercial  Web  search  engine.  Both  candidate 
size  normalisation  and  candidate-query  term  proximity  have  been 
shown  to  improve  the  retrieval  performance. 
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